Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
debakarr
GitHub Repository: debakarr/machinelearning
Path: blob/master/Part 4 - Clustering/Hierarchical Clustering/[R] Hierarchical Clustering.ipynb
1009 views
Kernel: R

Hierarchical Clustering

Data Preprocessing

dataset = read.csv('Mall_Customers.csv') X = dataset[4:5]
head(X, 10)

Using the dendrogram to find the optimal number of clusters

dendrogram = hclust(dist(X, method = 'euclidean'), method = 'ward.D') # ward is used to minimize the variance within each clusters plot(dendrogram, main = paste('Dendrogram'), xlab = 'Customer', ylab = 'Euclidian Distance')
Image in a Jupyter notebook

From the above dendrogram we can see that the maximum length verticle line whcih does not have any horizontal line is the one providing the number of clusters as 5.


Fitting hierarchical clustering to the Mall datset

hc = hclust(dist(X, method = 'euclidean'), method = 'ward.D') y_hc = cutree(hc, 5)
head(y_hc, 10)

Visualising the clusters

library(cluster) clusplot(X, y_hc, lines = 0, shade = TRUE, color = TRUE, labels = 5, plotchar = FALSE, span = TRUE, main = paste('Clusters of customers'), xlab = 'Annual Income', ylab = 'Spending Score') # for more see help(clusplot.default)
Image in a Jupyter notebook

The target customers should be the one with High Earning and High Spend. Here the datapoints inside cluster 1 are the customers with High Earning and High Spend